Goto

Collaborating Authors

 average reward






Appendix: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing Systems

Thus the optimal average reward of the original MDP and modified MDP differ by O ( ϵ). To ensure Assumption 3.1 (b) is satisfied, an aperiodicity transformation can be implemented. The proof of this theorem can be found in [Sch71]. From Lemma 2.2, we thus have, ( J In order to iterate Equation (8), need to ensure the terms are non-negative. Theorem 3.3 presents an upper bound on the error in terms of the average reward.



R-learninginactor-criticmodeloffersabiologically relevantmechanismforsequentialdecision-making

Neural Information Processing Systems

Afewstudies haveexplored sequential stay-or-leavedecisions in humans, or rodents - the model organism used to access neuronal activity at high resolution. In both cases, decision patterns were collected inforaging tasks-the experimental settings where subjects decide when to leave depleting resources (2).


A Proofs from Section 2 448 Algorithm 4: Output ˆ α null G1 (1 η

Neural Information Processing Systems

Return ˆ α We show the following generalization of Proposition 2.1. Moreover, Alg. 4 has sample complexity The sample complexity is clear so we focus on the first statement. Theorem 4.5 in [MU17]) on these events as i varies and noting that Hence recalling (A.2) above, we conclude that The other direction is similar. Using (A.2) in the same way as above, we find First we analyze the expected sample complexity. Finally Alg. 4 has sample complexity We do this using Bayes' rule.